By: Syeda Ambreen Karim Bokhari
Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models. Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market.
• brand_name: Name of manufacturing brand
• os: OS on which the phone runs
• screen_size: Size of the screen in cm
• 4g: Whether 4G is available or not
• 5g: Whether 5G is available or not
• main_camera_mp: Resolution of the rear camera in megapixels
• selfie_camera_mp: Resolution of the front camera in megapixels
• int_memory: Amount of internal memory (ROM) in GB
• ram: Amount of RAM in GB
• battery: Energy capacity of the phone battery in mAh
• weight: Weight of the phone in grams
• release_year: Year when the phone model was released
• days_used: Number of days the used/refurbished phone has been used
• new_price: Price of a new phone of the same model in euros
• used_price: Price of the used/refurbished phone in euros
We need to create an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones.
We need to Explore the dataset and extract insights from the data to answer the following questions:
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
cell=pd.read_csv("used_phone_data.csv")
data=cell.copy
np.random.seed(1)
cell.sample(n=10)
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2501 | Samsung | Android | 13.49 | yes | no | 13.0 | 13.0 | 32.0 | 4.00 | 3600.0 | 181.0 | 2017 | 683 | 198.680 | 79.47 |
| 2782 | Sony | Android | 13.81 | yes | no | NaN | 8.0 | 32.0 | 4.00 | 3300.0 | 156.0 | 2019 | 195 | 198.150 | 149.10 |
| 605 | Others | Android | 12.70 | yes | no | 8.0 | 5.0 | 16.0 | 4.00 | 2400.0 | 137.0 | 2015 | 1048 | 161.470 | 48.39 |
| 2923 | Vivo | Android | 19.37 | yes | no | 13.0 | 16.0 | 64.0 | 4.00 | 3260.0 | 149.3 | 2019 | 375 | 211.880 | 138.31 |
| 941 | Others | Others | 5.72 | no | no | 0.3 | 0.3 | 32.0 | 0.25 | 820.0 | 90.0 | 2013 | 883 | 29.810 | 8.92 |
| 1833 | LG | Android | 13.49 | no | no | 8.0 | 1.3 | 32.0 | 4.00 | 3140.0 | 161.0 | 2013 | 670 | 240.540 | 96.18 |
| 671 | Apple | iOS | 14.92 | yes | no | 12.0 | 7.0 | 64.0 | 4.00 | 5493.0 | 48.0 | 2018 | 403 | 700.150 | 350.08 |
| 1796 | LG | Android | 17.78 | yes | no | 5.0 | 0.3 | 16.0 | 4.00 | 4000.0 | 294.8 | 2014 | 708 | 189.300 | 75.94 |
| 757 | Asus | Android | 13.49 | yes | no | 13.0 | 8.0 | 32.0 | 4.00 | 5000.0 | 181.0 | 2017 | 612 | 270.500 | 108.13 |
| 3528 | Realme | Android | 15.72 | yes | no | NaN | 16.0 | 64.0 | 4.00 | 4035.0 | 184.0 | 2019 | 433 | 159.885 | 80.00 |
cell.shape
(3571, 15)
cell.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null object 1 os 3571 non-null object 2 screen_size 3571 non-null float64 3 4g 3571 non-null object 4 5g 3571 non-null object 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3561 non-null float64 8 ram 3561 non-null float64 9 battery 3565 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 418.6+ KB
cell.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 180 selfie_camera_mp 2 int_memory 10 ram 10 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
Observations:
There are missing values in the following columns:
# duplicate values check
cell.duplicated().sum()
0
# count of all variavle data types
cell.dtypes.value_counts()
float64 9 object 4 int64 2 dtype: int64
# check for number of unique values in the variables
cell.nunique()
brand_name 34 os 4 screen_size 127 4g 2 5g 2 main_camera_mp 44 selfie_camera_mp 37 int_memory 16 ram 14 battery 354 weight 613 release_year 8 days_used 930 new_price 3099 used_price 3044 dtype: int64
# Let's look at the statistical summary of the data
pd.set_option(
"display.float_format", lambda x: "%.3f" % x
) # to display numbers rounded off to 3 decimal places
cell.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3571 | 4 | Android | 3246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 14.804 | 5.153 | 2.700 | 12.700 | 13.490 | 16.510 | 46.360 |
| 4g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3391.000 | NaN | NaN | NaN | 9.400 | 4.818 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3569.000 | NaN | NaN | NaN | 6.547 | 6.879 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3561.000 | NaN | NaN | NaN | 54.533 | 84.696 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3561.000 | NaN | NaN | NaN | 4.057 | 1.392 | 0.030 | 4.000 | 4.000 | 4.000 | 16.000 |
| battery | 3565.000 | NaN | NaN | NaN | 3067.226 | 1364.207 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3564.000 | NaN | NaN | NaN | 179.424 | 90.281 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
import pandas_profiling
pandas_profiling.ProfileReport(cell)
# looking at value counts for non-numeric features
num_to_display = 20 # defining this up here so it's easy to change later if I want
for colname in cell.dtypes[cell.dtypes == 'object'].index:
val_counts = cell[colname].value_counts(dropna=False) # i want to see NA counts
print(val_counts[:num_to_display])
if len(val_counts) > num_to_display:
print(f'Only displaying first {num_to_display} of {len(val_counts)} values.')
print('\n\n') # just for more space between
Others 509 Samsung 364 Huawei 264 LG 212 Lenovo 172 ZTE 141 Xiaomi 134 Oppo 129 Asus 126 Alcatel 125 Nokia 121 Micromax 120 Honor 118 Vivo 117 Motorola 110 HTC 110 Sony 88 Meizu 62 Apple 59 Gionee 56 Name: brand_name, dtype: int64 Only displaying first 20 of 34 values. Android 3246 Others 202 Windows 67 iOS 56 Name: os, dtype: int64 yes 2359 no 1212 Name: 4g, dtype: int64 no 3419 yes 152 Name: 5g, dtype: int64
cell = cell.rename({"4g":"four_g","5g":"five_g"}, axis='columns')
cell.head()
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 23.970 | yes | no | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | 86.960 |
| 1 | Honor | Android | 28.100 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 161.490 |
| 2 | Honor | Android | 24.290 | yes | yes | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 268.550 |
| 3 | Honor | Android | 26.040 | yes | yes | 13.000 | 8.000 | 64.000 | 6.000 | 7250.000 | 480.000 | 2020 | 345 | 278.930 | 180.230 |
| 4 | Honor | Android | 15.720 | yes | no | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | 103.800 |
# Make catagorical variable into catagory.
cell['brand_name']=cell.brand_name.astype('category')
cell['os']=cell.os.astype('category')
cell['four_g']=cell.four_g.astype('category')
cell['five_g']=cell.five_g.astype('category')
#cell['release_year']=cell.release_year.astype('category')
cell.sort_values(by='battery', ascending=False).head(10)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 387 | Samsung | Android | 43.660 | yes | no | 12.000 | 5.000 | 64.000 | 3.000 | 12000.000 | 23.000 | 2019 | 266 | 659.470 | 460.580 |
| 401 | Apple | Others | 31.910 | yes | no | 12.000 | 7.000 | 128.000 | 6.000 | 9720.000 | 641.000 | 2020 | 220 | 999.580 | 749.840 |
| 666 | Apple | iOS | 31.910 | yes | no | 12.000 | 7.000 | 1024.000 | 4.000 | 9720.000 | 631.000 | 2018 | 543 | 1100.860 | 550.400 |
| 1647 | Lenovo | Android | 25.560 | yes | no | 8.000 | 1.600 | 16.000 | 4.000 | 9600.000 | 619.000 | 2014 | 723 | 300.610 | 120.290 |
| 1646 | Lenovo | Android | 33.500 | yes | no | 8.000 | 1.600 | 32.000 | 4.000 | 9600.000 | 950.000 | 2014 | 1085 | 409.190 | 122.830 |
| 2691 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 732.000 | 2014 | 799 | 469.180 | 187.650 |
| 2690 | Samsung | Android | 30.800 | yes | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 740.000 | 2014 | 921 | 499.550 | 149.850 |
| 2689 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 750.000 | 2014 | 689 | 538.780 | 215.540 |
| 2688 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 753.000 | 2014 | 1078 | 641.130 | 192.390 |
| 2687 | Samsung | Android | 30.800 | yes | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 753.000 | 2014 | 1089 | 669.490 | 200.760 |
cell.sort_values(by='battery').head(10)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1383 | Huawei | Others | 3.180 | no | no | 12.500 | 5.000 | 16.000 | 0.250 | 80.000 | 35.000 | 2016 | 643 | 197.400 | 79.000 |
| 1323 | Huawei | Others | 2.860 | no | no | 13.000 | 8.000 | 128.000 | 0.250 | 178.000 | 32.500 | 2018 | 610 | 149.370 | 59.840 |
| 690 | Apple | iOS | 3.330 | no | no | 12.000 | 6.000 | 32.000 | 4.000 | 205.000 | 25.000 | 2016 | 811 | 271.230 | 81.280 |
| 700 | Apple | iOS | 3.330 | no | no | 8.000 | 1.200 | 32.000 | 4.000 | 205.000 | 40.000 | 2014 | 590 | 648.510 | 324.180 |
| 702 | Apple | iOS | 3.330 | no | no | 8.000 | 1.200 | 32.000 | 4.000 | 205.000 | 25.000 | 2014 | 541 | 247.700 | 123.840 |
| 520 | Alcatel | Others | 6.030 | no | no | 5.000 | 2.000 | 16.000 | 4.000 | 210.000 | 142.950 | 2015 | 554 | 110.970 | 55.630 |
| 2841 | Sony | Android | 3.490 | no | no | 8.100 | 2.000 | 16.000 | 4.000 | 225.000 | 122.500 | 2013 | 890 | 80.180 | 24.090 |
| 2449 | Samsung | Others | 2.700 | no | no | 12.000 | 9.000 | 16.000 | 4.000 | 230.000 | 25.000 | 2019 | 337 | 251.640 | 163.100 |
| 1700 | LG | Android | 2.860 | no | no | 13.000 | 5.000 | 16.000 | 4.000 | 240.000 | 79.000 | 2018 | 598 | 260.450 | 130.100 |
| 1721 | LG | Android | 2.860 | no | no | 13.000 | 5.000 | 16.000 | 4.000 | 240.000 | 46.000 | 2017 | 810 | 171.040 | 51.270 |
battery has a very high maximum value but there may be many rows that have high battery values, lets check lowest battery values
cell.sort_values(by='screen_size', ascending=False).head(20)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2562 | Samsung | Android | 46.360 | yes | no | 2.100 | 5.000 | 32.000 | 4.000 | 5700.000 | 65.000 | 2015 | 627 | 639.870 | 255.730 |
| 387 | Samsung | Android | 43.660 | yes | no | 12.000 | 5.000 | 64.000 | 3.000 | 12000.000 | 23.000 | 2019 | 266 | 659.470 | 460.580 |
| 1646 | Lenovo | Android | 33.500 | yes | no | 8.000 | 1.600 | 32.000 | 4.000 | 9600.000 | 950.000 | 2014 | 1085 | 409.190 | 122.830 |
| 204 | Xiaomi | Android | 32.390 | yes | yes | 12.000 | 20.000 | 512.000 | 12.000 | 4050.000 | 241.000 | 2019 | 478 | 2498.240 | 1248.990 |
| 666 | Apple | iOS | 31.910 | yes | no | 12.000 | 7.000 | 1024.000 | 4.000 | 9720.000 | 631.000 | 2018 | 543 | 1100.860 | 550.400 |
| 680 | Apple | iOS | 31.910 | yes | no | 12.000 | 7.000 | 64.000 | 4.000 | 2256.000 | 677.000 | 2017 | 541 | 900.630 | 450.350 |
| 401 | Apple | Others | 31.910 | yes | no | 12.000 | 7.000 | 128.000 | 6.000 | 9720.000 | 641.000 | 2020 | 220 | 999.580 | 749.840 |
| 697 | Apple | iOS | 31.910 | yes | no | 8.000 | 1.200 | 32.000 | 4.000 | 3937.000 | 713.000 | 2015 | 572 | 849.520 | 424.880 |
| 1194 | Others | Android | 31.430 | no | no | 8.000 | 2.000 | 16.000 | 4.000 | 5900.000 | 350.000 | 2015 | 976 | 389.790 | 117.070 |
| 1472 | Others | Others | 31.270 | no | no | 5.000 | 2.000 | 32.000 | 4.000 | 4450.000 | 384.000 | 2014 | 767 | 170.810 | 68.370 |
| 1198 | Others | Android | 31.270 | no | no | 2.000 | 0.300 | 16.000 | 4.000 | 3800.000 | 313.000 | 2014 | 664 | 120.570 | 48.310 |
| 2382 | Others | Android | 31.270 | no | no | 5.000 | 2.000 | 32.000 | 4.000 | 2500.000 | 154.000 | 2015 | 761 | 120.320 | 48.100 |
| 2362 | Others | Android | 31.270 | no | no | 2.000 | 0.300 | 32.000 | 4.000 | 4700.000 | 406.000 | 2013 | 1025 | 140.290 | 41.960 |
| 1192 | Others | Android | 30.960 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 2500.000 | 850.000 | 2015 | 652 | 490.180 | 196.070 |
| 2691 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 732.000 | 2014 | 799 | 469.180 | 187.650 |
| 2687 | Samsung | Android | 30.800 | yes | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 753.000 | 2014 | 1089 | 669.490 | 200.760 |
| 2688 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 753.000 | 2014 | 1078 | 641.130 | 192.390 |
| 2692 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 730.200 | 2014 | 1061 | 420.940 | 126.320 |
| 1609 | Lenovo | Android | 30.800 | yes | no | 13.000 | 5.000 | 16.000 | 4.000 | 4250.000 | 250.000 | 2015 | 718 | 179.680 | 71.820 |
| 2689 | Samsung | Android | 30.800 | no | no | 8.000 | 2.000 | 32.000 | 4.000 | 9500.000 | 750.000 | 2014 | 689 | 538.780 | 215.540 |
cell.sort_values(by='screen_size').head(10)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2449 | Samsung | Others | 2.700 | no | no | 12.000 | 9.000 | 16.000 | 4.000 | 230.000 | 25.000 | 2019 | 337 | 251.640 | 163.100 |
| 388 | Samsung | Others | 2.860 | no | no | 13.000 | 8.000 | 4.000 | 0.750 | 300.000 | 172.000 | 2017 | 853 | 198.570 | 59.480 |
| 2567 | Samsung | Others | 2.860 | no | no | 8.000 | 5.000 | 16.000 | 4.000 | 250.000 | 42.000 | 2015 | 981 | 270.570 | 81.270 |
| 1721 | LG | Android | 2.860 | no | no | 13.000 | 5.000 | 16.000 | 4.000 | 240.000 | 46.000 | 2017 | 810 | 171.040 | 51.270 |
| 1336 | Huawei | Android | 2.860 | yes | no | 13.000 | 8.000 | 16.000 | 4.000 | 420.000 | 57.000 | 2018 | 414 | 269.200 | 134.680 |
| 2569 | Samsung | Others | 2.860 | no | no | 8.000 | 5.000 | 16.000 | 4.000 | 300.000 | 51.000 | 2015 | 604 | 301.160 | 120.460 |
| 2553 | Samsung | Others | 2.860 | no | no | 12.000 | 5.000 | 16.000 | 4.000 | 300.000 | 54.000 | 2016 | 829 | 300.100 | 89.910 |
| 1700 | LG | Android | 2.860 | no | no | 13.000 | 5.000 | 16.000 | 4.000 | 240.000 | 79.000 | 2018 | 598 | 260.450 | 130.100 |
| 2568 | Samsung | Others | 2.860 | no | no | 8.000 | 5.000 | 16.000 | 4.000 | 250.000 | 47.000 | 2015 | 918 | 220.630 | 66.240 |
| 1357 | Huawei | Android | 2.860 | yes | no | 8.000 | 8.000 | 16.000 | 4.000 | 420.000 | 64.000 | 2017 | 633 | 330.270 | 132.150 |
After googling about cell phones i came to know that:
Smallest screen size in dumb phone was 1.7 inches, which is 4.32 cm
Smallest screen size in smart phone i found was 3 inch which is 7.62 cm
Biggest screen size which i found was 8.01-inch a fold phone which is 20.35 inches
The cell phone above 20.35 inches and below 1.7 inches these ranges may not have the correct screen_size given.
cell.columns
Index(['brand_name', 'os', 'screen_size', 'four_g', 'five_g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price'],
dtype='object')
Questions:
# Function to create barplots that indicate percentage for each category.
def bar_perc(plot, feature):
'''
plot
feature: 1-d categorical feature array
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
def dist_box_violin(data):
# function plots a combined graph for univariate analysis of continous variable
#to check spread, central tendency , dispersion and outliers
Name=data.name.upper()
fig, axes =plt.subplots(1,3,figsize=(17, 7))
fig.suptitle("SPREAD OF DATA FOR "+ Name , fontsize=18, fontweight='bold')
sns.distplot(data,kde=False,color='Blue',ax=axes[0])
axes[0].axvline(data.mean(), color='y', linestyle='--',linewidth=2)
axes[0].axvline(data.median(), color='r', linestyle='dashed', linewidth=2)
axes[0].axvline(data.mode()[0],color='g',linestyle='solid',linewidth=2)
axes[0].legend({'Mean':data.mean(),'Median':data.median(),'Mode':data.mode()})
sns.boxplot(x=data,showmeans=True, orient='h',color="purple",ax=axes[1])
#just exploring violin plot
sns.violinplot(data,ax=axes[2],showmeans=True)
cell.used_price.describe(include='all')
count 3571.000 mean 109.880 std 121.501 min 2.510 25% 45.205 50% 75.530 75% 126.000 max 1916.540 Name: used_price, dtype: float64
dist_box_violin(cell.used_price)
Observations
cell.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3571 | 4 | Android | 3246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 14.804 | 5.153 | 2.700 | 12.700 | 13.490 | 16.510 | 46.360 |
| four_g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| five_g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3391.000 | NaN | NaN | NaN | 9.400 | 4.818 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3569.000 | NaN | NaN | NaN | 6.547 | 6.879 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3561.000 | NaN | NaN | NaN | 54.533 | 84.696 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3561.000 | NaN | NaN | NaN | 4.057 | 1.392 | 0.030 | 4.000 | 4.000 | 4.000 | 16.000 |
| battery | 3565.000 | NaN | NaN | NaN | 3067.226 | 1364.207 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3564.000 | NaN | NaN | NaN | 179.424 | 90.281 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
dist_box_violin(cell.new_price)
Observations
dist_box_violin(cell.screen_size)
list_col_n=['main_camera_mp','selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used']
dist_box_violin(cell.main_camera_mp)
dist_box_violin(cell.selfie_camera_mp)
dist_box_violin(cell.int_memory)
dist_box_violin(cell.ram)
dist_box_violin(cell.battery)
dist_box_violin(cell.weight)
dist_box_violin(cell.days_used)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Display count plots and pie charts of categorical variables
labeled_barplot(cell, "brand_name")
brand = cell.groupby(['brand_name', 'os'])['used_price'].mean().reset_index()
brand['used_price'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
brand_b = cell.groupby('brand_name')['used_price'].mean().reset_index()
brand_b['used_price'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
cell.used_price.nunique()
3044
# Display count plots and pie charts of categorical variables
list_col=['brand_name', 'os', 'four_g', 'five_g']
# function to display count plots and pie charts of categorical variable: language_preferred
colors_list = ['#3366cc','#651593','#a03b87','#e4a859','#da8266',"#FAAE7B", '#ffcc00','#ffff66']
f,ax=plt.subplots(1,2,figsize=(15,5))
cell['release_year'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],startangle=90,colors = colors_list,pctdistance=1.1, labeldistance=1.3)
ax[0].set_title('release_year')
ax[0].set_ylabel('')
sns.countplot('release_year',data=cell,ax=ax[1],palette =(['#651593','#3366cc','#a03b87','#da8266','#ffcc00',"#FAAE7B",'#e4a859','#ffff66']))
ax[1].set_title('release_year')
plt.show()
#axis=sns.countplot(x='release_year', data=cell ,palette='plasma').set(title='release_year')
#bar_perc(axis,['release_year'])
# Function to create barplots that indicate percentage for each category.
def bar_perc(plot, feature):
'''
plot
feature: 1-d categorical feature array
'''
total = len(feature) # length of the column
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
ax.annotate(percentage, (x, y), size = 12) # annotate the percantage
# percentage of products
ax = sns.countplot(cell['release_year'],palette='plasma')
plt.title('Percentage of cell phones with years')
plt.xlabel('Release Year')
plt.ylabel('Count')
bar_perc(ax,cell['release_year'])
# Display count plots and pie charts of categorical variables
list_col=['brand_name', 'os', 'four_g', 'five_g']
# function to display count plots and pie charts of categorical variable: language_preferred
colors_list = ['#651593','#a03b87','#da8266','#e4a859','#432371',"#FAAE7B"]
f,ax=plt.subplots(1,2,figsize=(15,5))
cell['os'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],startangle=90,colors = colors_list,pctdistance=1.1, labeldistance=1.3)
ax[0].set_title('OS')
ax[0].set_ylabel('')
sns.countplot('os',data=cell,ax=ax[1],palette =(['#651593','#a03b87','#e4a859',"#FAAE7B",'#da8266']))
ax[1].set_title('os')
plt.show()
# Display count plots and pie charts of categorical variables
list_col=['brand_name', 'os', 'four_g', 'five_g']
# function to display count plots and pie charts of categorical variable: language_preferred
colors_list = ['#a03b87','#da8266','#651593','#e4a859','#432371',"#FAAE7B"]
f,ax=plt.subplots(1,2,figsize=(15,5))
cell['four_g'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],startangle=90,colors = colors_list)
ax[0].set_title('four_g')
ax[0].set_ylabel('')
sns.countplot('four_g',data=cell,ax=ax[1],palette =(['#a03b87','#da8266','#432371','#3399ff',"#FAAE7B"]))
ax[1].set_title('four_g')
plt.show()
# Display count plots and pie charts of categorical variables
# function to display count plots and pie charts of categorical variable: language_preferred
colors_list = ['#a03b87','#da8266','#651593','#e4a859','#432371',"#FAAE7B"]
colors_list2 = ['#10121147','pink','#3399ff','#432371',"#FAAE7B"]
f,ax=plt.subplots(1,2,figsize=(15,5))
cell['five_g'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0],startangle=90,colors = colors_list)
ax[0].set_title('five_g')
ax[0].set_ylabel('')
sns.countplot('five_g',data=cell,ax=ax[1],palette =(['#a03b87','#da8266','#432371','#3399ff',"#FAAE7B"]))
ax[1].set_title('five_g')
plt.show()
# Function to create barplots that indicate percentage for each category.
def bar_perc(plot, feature):
'''
plot
feature: 1-d categorical feature array
'''
total = len(feature) # length of the column
for p in plot.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total) # percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # width of the plot
y = p.get_y() + p.get_height() # hieght of the plot
plot.annotate(percentage, (x, y), size = 12) # annotate the percentage
fig1, axes1 =plt.subplots(1,3,figsize=(14,5))
list_col=['os', 'four_g', 'five_g']
j=0
for i in range(len(list_col)):
order = cell[list_col[i]].value_counts(ascending=False).index # to display bar in ascending order
axis=sns.countplot(x=list_col[i], data=cell , order=order,ax=axes1[i],palette='plasma').set(title=list_col[i].upper())
bar_perc(axes1[i],cell[list_col[i]])
Questions:
temp=cell.copy()
#cell=temp.copy()
# most rows don't have missing values now
num_missing = cell.isnull().sum(axis=1)
num_missing.value_counts()
0 3368 1 193 2 8 3 2 dtype: int64
# these are missing int_memory and ram and battery
cell[num_missing == 3]
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2140 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | nan | nan | nan | 81.000 | 2016 | 815 | 29.210 | 8.650 |
| 2141 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | nan | nan | nan | 82.600 | 2016 | 1023 | 38.780 | 11.420 |
# these are missing int_memory and ram
cell[num_missing == 2]
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2142 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | nan | nan | 1200.000 | 91.800 | 2015 | 867 | 68.010 | 20.460 |
| 2143 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | nan | nan | 1200.000 | 91.800 | 2015 | 772 | 69.940 | 28.060 |
| 2144 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | nan | nan | 1100.000 | 79.000 | 2015 | 1012 | 41.920 | 12.520 |
| 2145 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | nan | nan | 1100.000 | 79.000 | 2015 | 798 | 40.480 | 16.130 |
| 2146 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | nan | nan | 800.000 | 69.600 | 2015 | 1009 | 19.920 | 6.070 |
| 2147 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | nan | nan | 800.000 | 69.800 | 2015 | 665 | 18.480 | 7.480 |
| 2148 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | nan | nan | 1100.000 | 78.400 | 2015 | 559 | 29.330 | 14.760 |
| 2149 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | nan | nan | 1100.000 | 78.600 | 2015 | 682 | 30.000 | 12.030 |
# these are missing int_memory and ram
cell[num_missing == 1].sample(15)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 287 | Infinix | Android | 16.190 | yes | no | nan | 8.000 | 32.000 | 2.000 | 5000.000 | 185.000 | 2020 | 329 | 88.880 | 57.950 |
| 418 | Coolpad | Android | 20.960 | yes | no | nan | 13.000 | 32.000 | 3.000 | 4000.000 | 170.000 | 2019 | 342 | 161.610 | 105.360 |
| 3483 | Motorola | Android | 16.030 | yes | no | nan | 8.000 | 64.000 | 4.000 | 5000.000 | 200.000 | 2020 | 277 | 122.400 | 84.730 |
| 2780 | Sony | Android | 16.030 | yes | no | nan | 8.000 | 64.000 | 4.000 | 3000.000 | 180.000 | 2019 | 252 | 429.100 | 295.970 |
| 356 | Oppo | Android | 15.880 | yes | yes | nan | 32.000 | 128.000 | 8.000 | 4025.000 | 180.000 | 2020 | 218 | 399.550 | 299.980 |
| 197 | Xiaomi | Android | 25.880 | yes | yes | nan | 16.000 | 128.000 | 6.000 | 4500.000 | 208.000 | 2020 | 184 | 260.810 | 196.450 |
| 2782 | Sony | Android | 13.810 | yes | no | nan | 8.000 | 32.000 | 4.000 | 3300.000 | 156.000 | 2019 | 195 | 198.150 | 149.100 |
| 200 | Xiaomi | Android | 18.730 | yes | no | nan | 8.000 | 32.000 | 2.000 | 5000.000 | 188.000 | 2020 | 321 | 80.820 | 52.130 |
| 325 | Meizu | Android | 16.030 | yes | no | nan | 16.000 | 128.000 | 6.000 | 4500.000 | 183.000 | 2019 | 348 | 251.190 | 164.230 |
| 164 | Vivo | Android | 24.290 | yes | yes | nan | 16.000 | 128.000 | 6.000 | 4500.000 | 198.100 | 2020 | 312 | 349.830 | 227.880 |
| 286 | Infinix | Android | 16.190 | yes | no | nan | 8.000 | 64.000 | 4.000 | 5000.000 | 185.000 | 2020 | 173 | 110.180 | 82.670 |
| 3383 | Realme | Android | 16.030 | yes | no | nan | 16.000 | 64.000 | 6.000 | 4000.000 | 199.000 | 2019 | 475 | 365.415 | 182.740 |
| 195 | Xiaomi | Android | 24.290 | yes | yes | nan | 16.000 | 64.000 | 6.000 | 4520.000 | 205.000 | 2020 | 301 | 200.600 | 128.760 |
| 3366 | Oppo | Android | 16.350 | yes | yes | nan | 32.000 | 256.000 | 12.000 | 4260.000 | 217.000 | 2020 | 193 | 934.991 | 699.640 |
| 101 | Meizu | Android | 16.030 | yes | no | nan | 8.000 | 32.000 | 2.000 | 4000.000 | 184.000 | 2019 | 190 | 180.400 | 137.750 |
cell.iloc[2138:2152]
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2138 | Nokia | Android | 13.490 | yes | no | 16.000 | 8.000 | 32.000 | 4.000 | 3000.000 | 169.000 | 2017 | 540 | 160.960 | 80.560 |
| 2139 | Nokia | Others | 5.720 | no | no | 2.000 | 8.000 | 32.000 | 4.000 | 3000.000 | 160.000 | 2017 | 554 | 60.260 | 30.190 |
| 2140 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | nan | nan | nan | 81.000 | 2016 | 815 | 29.210 | 8.650 |
| 2141 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | nan | nan | nan | 82.600 | 2016 | 1023 | 38.780 | 11.420 |
| 2142 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | nan | nan | 1200.000 | 91.800 | 2015 | 867 | 68.010 | 20.460 |
| 2143 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | nan | nan | 1200.000 | 91.800 | 2015 | 772 | 69.940 | 28.060 |
| 2144 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | nan | nan | 1100.000 | 79.000 | 2015 | 1012 | 41.920 | 12.520 |
| 2145 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | nan | nan | 1100.000 | 79.000 | 2015 | 798 | 40.480 | 16.130 |
| 2146 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | nan | nan | 800.000 | 69.600 | 2015 | 1009 | 19.920 | 6.070 |
| 2147 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | nan | nan | 800.000 | 69.800 | 2015 | 665 | 18.480 | 7.480 |
| 2148 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | nan | nan | 1100.000 | 78.400 | 2015 | 559 | 29.330 | 14.760 |
| 2149 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | nan | nan | 1100.000 | 78.600 | 2015 | 682 | 30.000 | 12.030 |
| 2150 | Nokia | Windows | 10.950 | yes | no | 5.000 | 1.600 | 32.000 | 4.000 | 1830.000 | 136.000 | 2014 | 809 | 99.780 | 29.860 |
| 2151 | Nokia | Android | 19.210 | no | no | 8.000 | 5.000 | 32.000 | 4.000 | 5300.000 | 318.000 | 2014 | 893 | 320.910 | 96.150 |
#changing the int_memory of these rows to '32', the previous value
cell.iloc[2140:2150,7]=cell.iloc[2138,7]
cell.iloc[2140:2150,8]=cell.iloc[2138,8]
cell.iloc[2140:2142,9]=cell.iloc[2142,9]
cell.iloc[2138:2152]
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2138 | Nokia | Android | 13.490 | yes | no | 16.000 | 8.000 | 32.000 | 4.000 | 3000.000 | 169.000 | 2017 | 540 | 160.960 | 80.560 |
| 2139 | Nokia | Others | 5.720 | no | no | 2.000 | 8.000 | 32.000 | 4.000 | 3000.000 | 160.000 | 2017 | 554 | 60.260 | 30.190 |
| 2140 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | 32.000 | 4.000 | 1200.000 | 81.000 | 2016 | 815 | 29.210 | 8.650 |
| 2141 | Nokia | Others | 5.720 | no | no | 0.300 | 0.300 | 32.000 | 4.000 | 1200.000 | 82.600 | 2016 | 1023 | 38.780 | 11.420 |
| 2142 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 1200.000 | 91.800 | 2015 | 867 | 68.010 | 20.460 |
| 2143 | Nokia | Others | 6.350 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 1200.000 | 91.800 | 2015 | 772 | 69.940 | 28.060 |
| 2144 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 1100.000 | 79.000 | 2015 | 1012 | 41.920 | 12.520 |
| 2145 | Nokia | Others | 5.720 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 1100.000 | 79.000 | 2015 | 798 | 40.480 | 16.130 |
| 2146 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 800.000 | 69.600 | 2015 | 1009 | 19.920 | 6.070 |
| 2147 | Nokia | Others | 3.180 | no | no | 2.000 | 2.000 | 32.000 | 4.000 | 800.000 | 69.800 | 2015 | 665 | 18.480 | 7.480 |
| 2148 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | 32.000 | 4.000 | 1100.000 | 78.400 | 2015 | 559 | 29.330 | 14.760 |
| 2149 | Nokia | Others | 5.720 | no | no | 0.300 | 2.000 | 32.000 | 4.000 | 1100.000 | 78.600 | 2015 | 682 | 30.000 | 12.030 |
| 2150 | Nokia | Windows | 10.950 | yes | no | 5.000 | 1.600 | 32.000 | 4.000 | 1830.000 | 136.000 | 2014 | 809 | 99.780 | 29.860 |
| 2151 | Nokia | Android | 19.210 | no | no | 8.000 | 5.000 | 32.000 | 4.000 | 5300.000 | 318.000 | 2014 | 893 | 320.910 | 96.150 |
cell.isnull().sum(axis=1).value_counts()
0 3378 1 193 dtype: int64
cell.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null category 1 os 3571 non-null category 2 screen_size 3571 non-null float64 3 four_g 3571 non-null category 4 five_g 3571 non-null category 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3571 non-null float64 8 ram 3571 non-null float64 9 battery 3567 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 322.8 KB
list_col_n=['main_camera_mp','selfie_camera_mp', 'battery', 'weight']
beg_null = cell.isnull().sum().sum()
print(beg_null)
for i in list_col_n:
# notice that we have filtering the columns as a list.
cell[[i]] = cell[[i]].apply(lambda x: x.fillna(x.median()))
end_null = cell.isnull().sum().sum()
print(end_null)
print("We have got rid of {} null values, filling them with the median.".format(beg_null - end_null))
193 13 We have got rid of 180 null values, filling them with the median. 11 We have got rid of 182 null values, filling them with the median. 7 We have got rid of 186 null values, filling them with the median. 0 We have got rid of 193 null values, filling them with the median.
cell.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null category 1 os 3571 non-null category 2 screen_size 3571 non-null float64 3 four_g 3571 non-null category 4 five_g 3571 non-null category 5 main_camera_mp 3571 non-null float64 6 selfie_camera_mp 3571 non-null float64 7 int_memory 3571 non-null float64 8 ram 3571 non-null float64 9 battery 3571 non-null float64 10 weight 3571 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 322.8 KB
# pairplot of numeric variables
sns.pairplot(cell, diag_kind='kde') # to plot density curve instead of histogram on the diag
<seaborn.axisgrid.PairGrid at 0x164c7998c10>
df_ram=cell.groupby("brand_name")["ram"].mean()
df_ram.mean
<bound method Series.mean of brand_name Acer 3.902 Alcatel 3.426 Apple 4.000 Asus 4.048 BlackBerry 3.830 Celkon 1.466 Coolpad 3.955 Gionee 3.933 Google 4.533 HTC 4.000 Honor 4.593 Huawei 4.641 Infinix 2.600 Karbonn 3.375 LG 3.894 Lava 3.278 Lenovo 3.887 Meizu 4.452 Micromax 3.750 Microsoft 4.000 Motorola 3.945 Nokia 3.634 OnePlus 6.364 Oppo 4.961 Others 3.751 Panasonic 4.000 Realme 4.195 Samsung 4.159 Sony 4.068 Spice 3.750 Vivo 4.756 XOLO 4.000 Xiaomi 4.567 ZTE 4.023 Name: ram, dtype: float64>
plt.figure(figsize=(15, 5))
sns.barplot('brand_name', 'ram', data=cell, palette='plasma');
plt.xticks(rotation=90, fontsize=15);
df_bw = cell[cell.battery > 4500]
df_bw.groupby("battery")["weight"].mean()
# # display point plot of time_spent_on_the_page vs language_preferred and converted
sns.lineplot(x=df_bw['battery'], y=df_bw["weight"]);
plt.title('weight VS battery')
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
df_bw.mean()
screen_size 21.487 main_camera_mp 9.001 selfie_camera_mp 7.206 int_memory 59.145 ram 4.246 battery 5884.103 weight 330.719 release_year 2017.113 days_used 568.760 new_price 307.462 used_price 155.901 dtype: float64
# function to convert screen size from cm to inches
def cm_to_inch (data):
data=round(data/2.54,2)
return data
temp=cell.copy()
cell[['screen_size']]=cm_to_inch(cell['screen_size'])
cell.sample(20)
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2514 | Samsung | Android | 5.440 | yes | no | 16.000 | 16.000 | 64.000 | 4.000 | 3300.000 | 172.000 | 2017 | 623 | 300.370 | 120.100 |
| 1342 | Huawei | Android | 6.060 | yes | no | 13.000 | 24.000 | 64.000 | 4.000 | 4000.000 | 180.000 | 2018 | 700 | 617.500 | 246.970 |
| 764 | Asus | Android | 5.310 | yes | no | 13.000 | 13.000 | 32.000 | 4.000 | 5000.000 | 170.000 | 2017 | 582 | 320.100 | 160.110 |
| 2530 | Samsung | Android | 5.440 | yes | no | 12.000 | 5.000 | 64.000 | 4.000 | 3500.000 | 169.000 | 2016 | 852 | 760.910 | 228.120 |
| 2173 | Nokia | Windows | 6.000 | yes | no | 20.000 | 1.200 | 16.000 | 4.000 | 3400.000 | 209.000 | 2013 | 1052 | 351.080 | 105.400 |
| 1346 | Huawei | Android | 11.190 | yes | no | 13.000 | 8.000 | 32.000 | 4.000 | 3000.000 | 155.000 | 2018 | 617 | 168.850 | 67.460 |
| 298 | Lenovo | Android | 8.440 | yes | no | 8.000 | 16.000 | 64.000 | 6.000 | 4000.000 | 159.000 | 2019 | 426 | 250.590 | 125.200 |
| 2367 | Others | Android | 7.000 | no | no | 2.000 | 0.300 | 32.000 | 4.000 | 4000.000 | 305.000 | 2013 | 657 | 131.140 | 52.510 |
| 594 | Others | Android | 6.130 | yes | no | 12.000 | 8.000 | 32.000 | 4.000 | 4000.000 | 168.000 | 2018 | 396 | 219.530 | 142.730 |
| 167 | Vivo | Android | 8.750 | yes | yes | 13.000 | 32.000 | 128.000 | 8.000 | 4350.000 | 196.500 | 2019 | 279 | 422.190 | 296.810 |
| 3196 | ZTE | Android | 5.310 | yes | no | 13.000 | 16.000 | 64.000 | 4.000 | 3000.000 | 164.000 | 2017 | 879 | 118.940 | 35.740 |
| 1145 | Honor | Android | 8.440 | yes | no | 13.000 | 14.500 | 128.000 | 4.000 | 3500.000 | 206.000 | 2018 | 443 | 549.640 | 274.880 |
| 200 | Xiaomi | Android | 7.370 | yes | no | 8.000 | 8.000 | 32.000 | 2.000 | 5000.000 | 188.000 | 2020 | 321 | 80.820 | 52.130 |
| 2445 | Samsung | Android | 6.250 | yes | no | 12.000 | 9.000 | 1024.000 | 4.000 | 4100.000 | 175.000 | 2019 | 291 | 1050.760 | 726.730 |
| 1489 | Karbonn | Android | 3.130 | no | no | 3.150 | 0.300 | 512.000 | 0.250 | 1625.000 | 114.500 | 2013 | 760 | 48.620 | 19.150 |
| 591 | Alcatel | Android | 5.000 | yes | no | 8.000 | 1.300 | 16.000 | 4.000 | 2500.000 | 135.000 | 2013 | 1042 | 351.450 | 105.400 |
| 119 | Nokia | Others | 2.250 | yes | no | 0.300 | 8.000 | 0.020 | 16.000 | 1200.000 | 86.500 | 2019 | 234 | 39.980 | 30.800 |
| 2115 | Nokia | Android | 7.130 | yes | no | 8.000 | 20.000 | 64.000 | 4.000 | 3500.000 | 180.000 | 2018 | 485 | 399.100 | 199.610 |
| 1381 | Huawei | Android | 5.310 | yes | no | 12.500 | 8.000 | 256.000 | 4.000 | 4000.000 | 169.000 | 2016 | 848 | 921.520 | 276.380 |
| 39 | Huawei | Android | 9.690 | yes | no | 13.000 | 16.000 | 128.000 | 6.000 | 4000.000 | 206.000 | 2019 | 460 | 290.740 | 145.440 |
temp.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 | 3571.000 |
| mean | 14.804 | 9.330 | 6.546 | 54.470 | 4.057 | 3066.105 | 179.384 | 2015.965 | 675.391 | 237.389 | 109.880 |
| std | 5.153 | 4.705 | 6.878 | 84.586 | 1.390 | 1363.777 | 90.197 | 2.292 | 248.641 | 197.546 | 121.501 |
| min | 2.700 | 0.080 | 0.300 | 0.005 | 0.030 | 80.000 | 23.000 | 2013.000 | 91.000 | 9.130 | 2.510 |
| 25% | 12.700 | 5.000 | 2.000 | 16.000 | 4.000 | 2100.000 | 140.000 | 2014.000 | 536.000 | 120.130 | 45.205 |
| 50% | 13.490 | 8.000 | 5.000 | 32.000 | 4.000 | 3000.000 | 159.000 | 2016.000 | 690.000 | 189.800 | 75.530 |
| 75% | 16.510 | 13.000 | 8.000 | 64.000 | 4.000 | 4000.000 | 184.000 | 2018.000 | 872.000 | 291.935 | 126.000 |
| max | 46.360 | 48.000 | 32.000 | 1024.000 | 16.000 | 12000.000 | 950.000 | 2020.000 | 1094.000 | 2560.200 | 1916.540 |
temp1=cell[cell['screen_size']>6]
print(temp.screen_size.value_counts().sort_values(ascending=False))
print(temp.screen_size.unique() )
print(temp.screen_size.nunique())
12.700 576
13.490 409
10.950 219
10.160 189
15.240 129
...
28.730 1
14.600 1
11.910 1
10.320 1
31.430 1
Name: screen_size, Length: 127, dtype: int64
[23.97 28.1 24.29 26.04 15.72 21.43 19.84 18.57 16.67 8.73 19.37 24.61
25.56 20.32 15.56 13.81 15.24 16.03 24.45 26.67 15.88 23.65 25.08 20.96
10.16 17.78 28.26 16.19 12.7 16.51 6.35 15.4 18.73 18.26 13.49 3.02
7.62 8.41 22.7 16.35 23.5 5.72 14.76 25.88 22.07 10.95 14.13 22.22
24.13 21.27 29.37 20.8 21.75 23.34 28.42 14.92 32.39 23.02 21.91 26.19
12.86 3.18 43.66 2.86 31.91 27.94 25.72 19.21 3.81 11.27 10.64 13.18
13.02 6.03 23.81 20.48 20.64 11.43 20.16 13.97 3.33 28.73 12.54 9.68
13.34 7.78 10.48 5.08 5.4 25.24 19.69 30.32 27.46 15.08 30.96 31.43
31.27 25.4 17.62 18.1 9.05 7.94 30.8 33.5 8.57 24.92 8.89 27.78
20. 18.42 5.56 14.6 26.35 18.89 8.1 2.7 13.65 46.36 9.84 22.38
10.8 11.91 11.11 3.49 19.53 10.32 17.94]
127
plt.figure(figsize=(15, 5))
sns.barplot('brand_name', 'screen_size', data=temp1, palette='plasma');
plt.xticks(rotation=90, fontsize=15);
# # display point plot of time_spent_on_the_page vs language_preferred and converted
plt.figure(figsize=(15, 5))
sns.lineplot(x=temp1['brand_name'], y=temp1["screen_size"]);
plt.title('screen_size >6 of brand_name')
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
# function to plot distribution of given data
def plot_distribution(inp):
plt.figure()
ax = sns.distplot(inp)
plt.axvline(np.mean(inp), color ='k',linestyle='dashed', linewidth=5)
_,max_ = plt.ylim()
plt.text(
inp.mean() + inp.mean() / 10,
max_ - max_ / 10,
"Mean; {:.2f}".format(inp.mean()),
)
return plt.figure
selfie=cell[cell['selfie_camera_mp']>8]
selfie.groupby('brand_name')['selfie_camera_mp'].count()
selfie.sort_values(by='selfie_camera_mp')
selfie.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 666.000 | 7.218 | 1.808 | 1.060 | 6.130 | 6.340 | 8.675 | 12.750 |
| main_camera_mp | 666.000 | 11.386 | 4.297 | 0.300 | 8.000 | 12.000 | 13.000 | 48.000 |
| selfie_camera_mp | 666.000 | 18.680 | 6.312 | 9.000 | 16.000 | 16.000 | 20.000 | 32.000 |
| int_memory | 666.000 | 108.240 | 83.033 | 4.000 | 64.000 | 128.000 | 128.000 | 1024.000 |
| ram | 666.000 | 5.266 | 2.089 | 0.030 | 4.000 | 4.000 | 6.000 | 16.000 |
| battery | 666.000 | 3828.559 | 776.013 | 230.000 | 3400.000 | 4000.000 | 4200.000 | 6000.000 |
| weight | 666.000 | 180.390 | 28.711 | 25.000 | 165.550 | 181.000 | 196.375 | 300.000 |
| release_year | 666.000 | 2018.631 | 1.234 | 2013.000 | 2018.000 | 2019.000 | 2019.000 | 2020.000 |
| days_used | 666.000 | 424.200 | 206.416 | 91.000 | 264.000 | 385.500 | 545.250 | 1091.000 |
| new_price | 666.000 | 387.318 | 291.621 | 99.700 | 218.923 | 299.340 | 459.870 | 2560.200 |
| used_price | 666.000 | 228.294 | 195.126 | 35.740 | 118.400 | 165.020 | 268.502 | 1916.540 |
plt.figure(figsize=(15, 5))
sns.barplot('brand_name', 'selfie_camera_mp', data=selfie, palette='plasma');
plt.xticks(rotation=90, fontsize=15);
# # display point plot of time_spent_on_the_page vs language_preferred and converted
plt.figure(figsize=(15, 5))
sns.lineplot(x=selfie['brand_name'], y=selfie["selfie_camera_mp"]);
plt.title('selfie_camera_mp >8mp of brand_name')
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
# plot distribution of selfie_camera_mp of used phones
plot_distribution(selfie["selfie_camera_mp"])
<function matplotlib.pyplot.figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True, FigureClass=<class 'matplotlib.figure.Figure'>, clear=False, **kwargs)>
# heatmap of cell data to show correlation between numeric variables
numeric_columns = cell.select_dtypes(include=np.number).columns.tolist()
#numeric_columns.remove("Year") # dropping year column as it is temporal variable
plt.figure(figsize=(15,7))
sns.heatmap(cell.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='PuRd')
<AxesSubplot:>
budget=cell.copy()
groupby = budget.groupby('brand_name', axis=0)
groupby.mean()
# Binning of the data based on a condition 0: budget<=500, 1: expensive:>500
budget.loc[(budget['used_price'] <= 250), 'u_budget_price'] = "low"# budget phones under 500 price range
budget.loc[(budget['used_price'] > 250) & (budget['used_price'] <= 500), 'u_budget_price'] ="medium" # budget phones under 500 price range
budget.loc[(budget['used_price'] > 500), 'u_budget_price'] = 'high' # phones above 500 price range
# Grouping and couting
budget.groupby('u_budget_price')['brand_name'].count().sort_values(ascending=False)
#df.loc[(df['first_name'] == 'Bill') | (df['first_name'] == 'Emma'), 'name_match'] = 'Match'
#df.loc[(df['first_name'] != 'Bill') & (df['first_name'] != 'Emma'), 'name_match'] = 'Mismatch'
u_budget_price low 3289 medium 215 high 67 Name: brand_name, dtype: int64
# percentage of budget cell phones vs expensive phones.
ax = sns.countplot(budget['u_budget_price'],palette='plasma')
plt.title('Percentage of budget cell phones with 500 or less used_price,"Low budget":<250, "Medium budget": <=500, "High budget": >500 ')
plt.xlabel('u_budget_price')
plt.ylabel('Count')
bar_perc(ax,budget['u_budget_price'])
df_hm = cell.pivot_table(
index="brand_name", columns="os", values="used_price", aggfunc=np.median
)
# Draw a heatmap
f, ax = plt.subplots(figsize=(10, 7))
sns.heatmap(df_hm, cmap="Spectral", linewidths=0.5, annot=True, ax=ax)
plt.show()
df_hm = cell.pivot_table(
index="release_year", columns="os", values="used_price", aggfunc=np.median
)
# Draw a heatmap
f, ax = plt.subplots(figsize=(10, 7))
sns.heatmap(df_hm, cmap="Spectral", linewidths=0.5, annot=True, ax=ax)
plt.show()
# Display hist plots of all variables VS time_spent_on_the_page
list_col=['main_camera_mp','selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'days_used','new_price','four_g','five_g', 'release_year']
for i in range(len(list_col)):
plt.figure(figsize=(15, 5))
# order = cell[list_col[i]].value_counts(ascending=False).index # to display bar in ascending order
axis=sns.scatterplot(data=cell, y="used_price", x=list_col[i],hue='os' )#kde = True
plt.title("used_price vs "+ str(list_col[i]));
plt.show()
# average used_price vs other variables
plt.figure(figsize=(15, 7))
sns.lineplot(x="brand_name", y="release_year",hue="os", data=cell, ci=None)
plt.title("release_year VS brand_name VS os")
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
plt.figure(figsize=(15, 7))
sns.lineplot(x="release_year", y="days_used",hue="os", data=cell, ci=None)
plt.title("release_year VS days_used VS os")
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
plt.figure(figsize=(15, 7))
sns.lineplot(x="selfie_camera_mp", y="release_year",hue="os", data=cell, ci=None)
plt.title("release_year VS selfie_camera_mp VS os")
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
plt.figure(figsize=(15, 7))
sns.scatterplot(x="battery", y="days_used",hue="os", data=cell, ci=None)
plt.title("battery VS days_used VS os")
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
plt.figure(figsize=(15, 5))
sns.scatterplot(y="used_price", x="brand_name", hue="os", data=cell)
plt.title("used_price VS brand_name VS os")
plt.xticks(rotation=90); # To rotate the x axis labls
plt.show()
Let's treat outliers in the data by flooring and capping.
df=cell.copy()
All of the points in each plot are drawn from the exact same distribution, so it's not fair to call any of the points outliers in the sense of there being bad data. But depending on the distribution in question, we may have almost all of the z-scores between -3 and 3 or instead there could be extremely large values.
# Frequency Encoding brand_name
# as there are 34 unique brands, OHE will make many variables.
bn=df.groupby('brand_name').size()/len(df)
df.loc[:,'brand_name_encode'] = df['brand_name'].map(bn)
df
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | brand_name_encode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 9.440 | yes | no | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | 86.960 | 0.033 |
| 1 | Honor | Android | 11.060 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 161.490 | 0.033 |
| 2 | Honor | Android | 9.560 | yes | yes | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 268.550 | 0.033 |
| 3 | Honor | Android | 10.250 | yes | yes | 13.000 | 8.000 | 64.000 | 6.000 | 7250.000 | 480.000 | 2020 | 345 | 278.930 | 180.230 | 0.033 |
| 4 | Honor | Android | 6.190 | yes | no | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | 103.800 | 0.033 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3566 | Asus | Android | 6.250 | yes | no | 8.000 | 8.000 | 64.000 | 6.000 | 5000.000 | 190.000 | 2019 | 232 | 654.500 | 490.960 | 0.035 |
| 3567 | Asus | Android | 6.000 | yes | no | 13.000 | 8.000 | 128.000 | 8.000 | 4000.000 | 200.000 | 2018 | 541 | 518.814 | 259.300 | 0.035 |
| 3568 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 3.000 | 4000.000 | 165.000 | 2020 | 201 | 92.650 | 69.810 | 0.035 |
| 3569 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 2.000 | 4000.000 | 160.000 | 2020 | 149 | 101.915 | 76.070 | 0.035 |
| 3570 | Alcatel | Android | 5.310 | yes | no | 13.000 | 5.000 | 16.000 | 2.000 | 4000.000 | 168.000 | 2020 | 176 | 72.241 | 50.480 | 0.035 |
3571 rows × 16 columns
plt.hist(df['brand_name_encode'], 20)
plt.title('Histogram of brand_name_encode')
plt.show()
# heatmap of cell data to show correlation between numeric variables
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
#numeric_columns.remove("Year") # dropping year column as it is temporal variable
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f', cmap='PuRd')
<AxesSubplot:>
plt.hist(df['ram'], 20)
plt.title('Histogram of ram')
plt.show()
sns.boxplot(df['ram'])
plt.title('Boxplot of ram')
plt.show()
quartiles = np.quantile(df['ram'][df['ram'].notnull()], [.25, .75])
ram_4iqr = 4 * (quartiles[1] - quartiles[0])
print(f'Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {ram_4iqr}')
outlier_ram = df.loc[np.abs(df['ram'] - df['ram'].median()) > ram_4iqr, 'ram']
outlier_ram.value_counts().sum()
Q1 = 4.0, Q3 = 4.0, 4*IQR = 0.0
643
Most of ram values are 4 so IQR=0. i dont want to loose the variance. So, essentially I need to put a filter on the data frame such that we select all rows where the values of a certain column are within, 3 standard deviations from mean.
# I'm copying the variable to a new variable.
df
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | brand_name_encode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 9.440 | yes | no | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | 86.960 | 0.033 |
| 1 | Honor | Android | 11.060 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 161.490 | 0.033 |
| 2 | Honor | Android | 9.560 | yes | yes | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 268.550 | 0.033 |
| 3 | Honor | Android | 10.250 | yes | yes | 13.000 | 8.000 | 64.000 | 6.000 | 7250.000 | 480.000 | 2020 | 345 | 278.930 | 180.230 | 0.033 |
| 4 | Honor | Android | 6.190 | yes | no | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | 103.800 | 0.033 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3566 | Asus | Android | 6.250 | yes | no | 8.000 | 8.000 | 64.000 | 6.000 | 5000.000 | 190.000 | 2019 | 232 | 654.500 | 490.960 | 0.035 |
| 3567 | Asus | Android | 6.000 | yes | no | 13.000 | 8.000 | 128.000 | 8.000 | 4000.000 | 200.000 | 2018 | 541 | 518.814 | 259.300 | 0.035 |
| 3568 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 3.000 | 4000.000 | 165.000 | 2020 | 201 | 92.650 | 69.810 | 0.035 |
| 3569 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 2.000 | 4000.000 | 160.000 | 2020 | 149 | 101.915 | 76.070 | 0.035 |
| 3570 | Alcatel | Android | 5.310 | yes | no | 13.000 | 5.000 | 16.000 | 2.000 | 4000.000 | 168.000 | 2020 | 176 | 72.241 | 50.480 | 0.035 |
3571 rows × 16 columns
df.columns
Index(['brand_name', 'os', 'screen_size', 'four_g', 'five_g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price',
'brand_name_encode'],
dtype='object')
df5=df.copy()
#df=df5.copy()
#df=cell.copy()
#df
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null category 1 os 3571 non-null category 2 screen_size 3571 non-null float64 3 four_g 3571 non-null category 4 five_g 3571 non-null category 5 main_camera_mp 3571 non-null float64 6 selfie_camera_mp 3571 non-null float64 7 int_memory 3571 non-null float64 8 ram 3571 non-null float64 9 battery 3571 non-null float64 10 weight 3571 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 15 brand_name_encode 3571 non-null float64 dtypes: category(4), float64(10), int64(2) memory usage: 350.7 KB
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3571 | 4 | Android | 3246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 5.828 | 2.029 | 1.060 | 5.000 | 5.310 | 6.500 | 18.250 |
| four_g | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| five_g | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3571.000 | NaN | NaN | NaN | 9.330 | 4.705 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3571.000 | NaN | NaN | NaN | 6.546 | 6.878 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3571.000 | NaN | NaN | NaN | 54.470 | 84.586 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3571.000 | NaN | NaN | NaN | 4.057 | 1.390 | 0.030 | 4.000 | 4.000 | 4.000 | 16.000 |
| battery | 3571.000 | NaN | NaN | NaN | 3066.105 | 1363.777 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3571.000 | NaN | NaN | NaN | 179.384 | 90.197 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
| brand_name_encode | 3571.000 | NaN | NaN | NaN | 0.058 | 0.043 | 0.003 | 0.031 | 0.038 | 0.074 | 0.143 |
from scipy import stats
df=df[np.abs(df.ram-df.ram.mean()) <= (3*df.ram.std())]
# keep only the ones that are within +3 to -3 standard deviations in the column 'ram'.
# calculating z scores of all variables
#from scipy import stats
#cols_z_score = ['screen_size','main_camera_mp','selfie_camera_mp','new_price',
# 'battery','weight','int_memory','days_used', 'ram']
#df=df[(np.abs(stats.zscore(df[cols_z_score])) < 3).all(axis=1)]
df.ram.describe()
count 3549.000 mean 4.003 std 1.208 min 0.030 25% 4.000 50% 4.000 75% 4.000 max 8.000 Name: ram, dtype: float64
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3549.000 | 5.823 | 2.024 | 1.060 | 5.000 | 5.310 | 6.440 | 18.250 |
| main_camera_mp | 3549.000 | 9.334 | 4.713 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3549.000 | 6.477 | 6.812 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3549.000 | 53.362 | 82.694 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3549.000 | 4.003 | 1.208 | 0.030 | 4.000 | 4.000 | 4.000 | 8.000 |
| battery | 3549.000 | 3062.109 | 1362.480 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3549.000 | 179.395 | 90.358 | 23.000 | 140.000 | 159.000 | 183.000 | 950.000 |
| release_year | 3549.000 | 2015.943 | 2.281 | 2013.000 | 2014.000 | 2015.000 | 2018.000 | 2020.000 |
| days_used | 3549.000 | 677.866 | 247.218 | 91.000 | 541.000 | 692.000 | 873.000 | 1094.000 |
| new_price | 3549.000 | 233.293 | 185.461 | 9.130 | 120.090 | 189.650 | 290.530 | 2560.200 |
| used_price | 3549.000 | 106.953 | 112.884 | 2.510 | 45.140 | 75.420 | 125.250 | 1916.540 |
| brand_name_encode | 3549.000 | 0.058 | 0.043 | 0.003 | 0.031 | 0.038 | 0.074 | 0.143 |
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3549 entries, 0 to 3570 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3549 non-null category 1 os 3549 non-null category 2 screen_size 3549 non-null float64 3 four_g 3549 non-null category 4 five_g 3549 non-null category 5 main_camera_mp 3549 non-null float64 6 selfie_camera_mp 3549 non-null float64 7 int_memory 3549 non-null float64 8 ram 3549 non-null float64 9 battery 3549 non-null float64 10 weight 3549 non-null float64 11 release_year 3549 non-null int64 12 days_used 3549 non-null int64 13 new_price 3549 non-null float64 14 used_price 3549 non-null float64 15 brand_name_encode 3549 non-null float64 dtypes: category(4), float64(10), int64(2) memory usage: 376.2 KB
# histogram of ram after removing the outliers
df['ram'].hist(bins=20)
plt.title('ram after removing 3 standard deviation values, the outliers')
plt.show()
print(df['ram'].mean())
print(df['ram'].median())
plt.show()
4.00305438151592 4.0
#function to find values between 1.5* 0.25 and 1.5* 0.75 quartiles
def frac_outside_1pt5_IQR(x):
length = 1.5 * np.diff(np.quantile(x, [.25, .75]))
return np.mean(np.abs(x - np.median(x)) > length)
#print(frac_outside_1pt5_IQR(x1))
#print(frac_outside_1pt5_IQR(x2))
plt.hist(df['screen_size'], 20)
plt.title('Histogram of screen_size')
plt.show()
sns.boxplot(df['screen_size'])
plt.title('Boxplot of screen_size')
plt.show()
screen_size has a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed but these huge points aren't consistent with the overall distribution of the data. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy.
#values between 4* 0.25 and 4* 0.75 quartiles of screen_size
quartiles = np.quantile(df['screen_size'][df['screen_size'].notnull()], [.25, .75])
screen_4iqr = 4 * (quartiles[1] - quartiles[0])
print(f'Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {screen_4iqr}')
outlier_screen = df.loc[np.abs(df['screen_size'] - df['screen_size'].median()) > screen_4iqr, 'screen_size']
outlier_screen
Q1 = 5.0, Q3 = 6.44, 4*IQR = 5.760000000000002
60 11.130 165 11.560 175 11.560 176 11.560 191 11.190 285 11.130 387 17.190 401 12.560 666 12.560 680 12.560 697 12.560 712 11.310 742 11.190 862 11.190 863 11.190 1077 11.190 1151 11.940 1157 11.190 1160 11.190 1192 12.190 1194 12.370 1196 11.940 1198 12.310 1345 11.190 1346 11.190 1347 11.190 1472 12.310 1549 11.940 1555 11.190 1609 12.130 1646 13.190 1873 11.190 2110 11.190 2362 12.310 2382 12.310 2562 18.250 2687 12.130 2688 12.130 2689 12.130 2690 12.130 2691 12.130 2692 12.130 2760 11.190 2958 11.190 3044 11.190 3054 11.190 3055 11.190 3056 11.190 3057 11.190 3058 11.190 3059 11.190 3065 11.190 3178 11.190 3439 11.190 Name: screen_size, dtype: float64
# keeping only IQR * 4 values of screen_size and replacing outliers with median
df['screen_size'].hist(bins=20)
plt.title('screen_size before removing the outliers')
plt.show()
print(df['screen_size'].mean())
# if we wanted to make these NA we could just do this
# [not run]
df.loc[np.abs(df['screen_size'] - df['screen_size'].median()) > screen_4iqr, 'screen_size'] = np.nan
df['screen_size'].hist(bins=20)
plt.title('screen_size after removing outliers')
plt.show()
5.823020569174413
# dropping these rows
df.drop(outlier_screen.index, axis=0, inplace=True)
screen = df['screen_size'][df['screen_size'].notnull()]
print(screen.mean()) # the mean and median look close
print(screen.median())
5.729470672389133 5.31
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3495 entries, 0 to 3570 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3495 non-null category 1 os 3495 non-null category 2 screen_size 3495 non-null float64 3 four_g 3495 non-null category 4 five_g 3495 non-null category 5 main_camera_mp 3495 non-null float64 6 selfie_camera_mp 3495 non-null float64 7 int_memory 3495 non-null float64 8 ram 3495 non-null float64 9 battery 3495 non-null float64 10 weight 3495 non-null float64 11 release_year 3495 non-null int64 12 days_used 3495 non-null int64 13 new_price 3495 non-null float64 14 used_price 3495 non-null float64 15 brand_name_encode 3495 non-null float64 dtypes: category(4), float64(10), int64(2) memory usage: 370.5 KB
weight has a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy.
quartiles = np.quantile(df['weight'][df['weight'].notnull()], [.25, .75])
weight_3iqr = 3 * (quartiles[1] - quartiles[0])
print(f'Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 3*IQR = {weight_3iqr}')
outlier_weight = df.loc[np.abs(df['weight'] - df['weight'].median()) > weight_3iqr, 'weight']
outlier_weight
Q1 = 140.0, Q3 = 183.0, 3*IQR = 129.0
3 480.000
22 453.600
32 310.000
34 450.000
37 460.000
...
3162 469.000
3320 300.000
3392 290.000
3393 290.000
3463 300.000
Name: weight, Length: 307, dtype: float64
# dropping these rows
df.drop(outlier_weight.index, axis=0, inplace=True)
df.shape
(3188, 16)
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3188.000 | 5.469 | 1.701 | 1.130 | 4.440 | 5.160 | 6.130 | 11.060 |
| main_camera_mp | 3188.000 | 9.642 | 4.707 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3188.000 | 6.793 | 6.985 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3188.000 | 54.199 | 81.582 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3188.000 | 4.014 | 1.231 | 0.030 | 4.000 | 4.000 | 4.000 | 8.000 |
| battery | 3188.000 | 2838.205 | 1066.461 | 80.000 | 2000.000 | 2900.000 | 3500.000 | 8220.000 |
| weight | 3188.000 | 155.488 | 34.230 | 30.000 | 138.900 | 155.000 | 174.000 | 285.000 |
| release_year | 3188.000 | 2015.989 | 2.273 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3188.000 | 675.866 | 248.786 | 91.000 | 535.000 | 691.000 | 872.000 | 1094.000 |
| new_price | 3188.000 | 227.483 | 174.275 | 9.130 | 118.998 | 181.100 | 280.955 | 1751.180 |
| used_price | 3188.000 | 104.383 | 103.351 | 2.510 | 44.703 | 74.965 | 124.555 | 1049.970 |
| brand_name_encode | 3188.000 | 0.056 | 0.042 | 0.003 | 0.031 | 0.036 | 0.074 | 0.143 |
# function to treat outliers by converting
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
#Lower_Whisker = Q1 - 3 * IQR
#Upper_Whisker = Q3 + 3 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
#def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
# for c in col_list:
# df = treat_outliers(df, c)
# return df
df.columns
Index(['brand_name', 'os', 'screen_size', 'four_g', 'five_g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price',
'brand_name_encode'],
dtype='object')
#BOXPLOTS before outlier removal
# let's plot the boxplots of all columns to check for outliers
numeric_columns=['screen_size','main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'days_used', 'new_price', 'used_price']#'release_year',
plt.figure(figsize=(15, 5))
for i, variable in enumerate(numeric_columns):
#plt.subplot(5, 2, i + 1)
#plt.boxplot(df[variable], whis=1.5)
# plot distribution of selfie_camera_mp of used phones
plot_distribution(df[variable])
plt.tight_layout()
plt.title(variable)
plt.show()
<Figure size 1080x360 with 0 Axes>
# treating the outliers
#numerical_col = df.select_dtypes(include=np.number).columns.tolist()
#df = treat_outliers_all(df, numerical_col)
# distributions after treating outliers
cols_to_ol = ['screen_size','new_price', 'used_price','battery','weight']# 'ram',],'release_year',,'days_used','int_memory''main_camera_mp','selfie_camera_mp',
for colname in cols_to_ol:
df = treat_outliers(df, colname) # applying outlier removal function
sns.histplot(df[colname], bins=50, kde=True)
plt.show()
#def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
# for c in col_list:
# df = treat_outliers(df, c)
# return df
#BOXPLOT after outliers removal
# let's plot the boxplots of all columns to check for outliers
numeric_columns=['screen_size','new_price','used_price','main_camera_mp','selfie_camera_mp','int_memory','ram','battery','weight','days_used']
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3188 entries, 0 to 3570 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3188 non-null category 1 os 3188 non-null category 2 screen_size 3188 non-null float64 3 four_g 3188 non-null category 4 five_g 3188 non-null category 5 main_camera_mp 3188 non-null float64 6 selfie_camera_mp 3188 non-null float64 7 int_memory 3188 non-null float64 8 ram 3188 non-null float64 9 battery 3188 non-null float64 10 weight 3188 non-null float64 11 release_year 3188 non-null int64 12 days_used 3188 non-null int64 13 new_price 3188 non-null float64 14 used_price 3188 non-null float64 15 brand_name_encode 3188 non-null float64 dtypes: category(4), float64(10), int64(2) memory usage: 338.1 KB
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| screen_size | 3188.000 | 5.413 | 1.521 | 1.905 | 4.440 | 5.160 | 6.130 | 8.665 |
| main_camera_mp | 3188.000 | 9.642 | 4.707 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3188.000 | 6.793 | 6.985 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3188.000 | 54.199 | 81.582 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3188.000 | 4.014 | 1.231 | 0.030 | 4.000 | 4.000 | 4.000 | 8.000 |
| battery | 3188.000 | 2834.641 | 1054.200 | 80.000 | 2000.000 | 2900.000 | 3500.000 | 5750.000 |
| weight | 3188.000 | 155.617 | 30.321 | 86.250 | 138.900 | 155.000 | 174.000 | 226.650 |
| release_year | 3188.000 | 2015.989 | 2.273 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3188.000 | 675.866 | 248.786 | 91.000 | 535.000 | 691.000 | 872.000 | 1094.000 |
| new_price | 3188.000 | 214.928 | 130.306 | 9.130 | 118.998 | 181.100 | 280.955 | 523.891 |
| used_price | 3188.000 | 93.226 | 64.987 | 2.510 | 44.703 | 74.965 | 124.555 | 244.334 |
| brand_name_encode | 3188.000 | 0.056 | 0.042 | 0.003 | 0.031 | 0.036 | 0.074 | 0.143 |
from sklearn.preprocessing import StandardScaler, MinMaxScaler
df.columns
Index(['brand_name', 'os', 'screen_size', 'four_g', 'five_g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price',
'brand_name_encode'],
dtype='object')
df_cell=df.copy()
df_cell
| brand_name | os | screen_size | four_g | five_g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | brand_name_encode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 8.665 | yes | no | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | 86.960 | 0.033 |
| 1 | Honor | Android | 8.665 | yes | yes | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 161.490 | 0.033 |
| 2 | Honor | Android | 8.665 | yes | yes | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 244.334 | 0.033 |
| 4 | Honor | Android | 6.190 | yes | no | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | 103.800 | 0.033 |
| 5 | Honor | Android | 8.440 | yes | no | 13.000 | 8.000 | 64.000 | 4.000 | 4000.000 | 176.000 | 2020 | 223 | 157.700 | 113.670 | 0.033 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3566 | Asus | Android | 6.250 | yes | no | 8.000 | 8.000 | 64.000 | 6.000 | 5000.000 | 190.000 | 2019 | 232 | 523.891 | 244.334 | 0.035 |
| 3567 | Asus | Android | 6.000 | yes | no | 13.000 | 8.000 | 128.000 | 8.000 | 4000.000 | 200.000 | 2018 | 541 | 518.814 | 244.334 | 0.035 |
| 3568 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 3.000 | 4000.000 | 165.000 | 2020 | 201 | 92.650 | 69.810 | 0.035 |
| 3569 | Alcatel | Android | 7.370 | yes | no | 13.000 | 5.000 | 32.000 | 2.000 | 4000.000 | 160.000 | 2020 | 149 | 101.915 | 76.070 | 0.035 |
| 3570 | Alcatel | Android | 5.310 | yes | no | 13.000 | 5.000 | 16.000 | 2.000 | 4000.000 | 168.000 | 2020 | 176 | 72.241 | 50.480 | 0.035 |
3188 rows × 16 columns
df.columns
Index(['brand_name', 'os', 'screen_size', 'four_g', 'five_g', 'main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'release_year', 'days_used', 'new_price', 'used_price',
'brand_name_encode'],
dtype='object')
Observations
# A copy of the scaled df
#df=df_cell.copy()
df=pd.get_dummies(df, columns=['four_g'])
df=pd.get_dummies(df, columns=['five_g'])
df=pd.get_dummies(df, columns=['os'])
#df_cell=df_cell.drop('os_others' , axis=1,inplace=True)
print(df.sample(10))
brand_name screen_size main_camera_mp selfie_camera_mp int_memory \
2084 Motorola 4.500 8.000 2.000 16.000
217 ZTE 8.630 14.500 8.000 128.000
3223 ZTE 5.000 16.000 8.000 64.000
3314 Huawei 8.665 10.500 16.000 128.000
2557 Samsung 5.310 13.000 5.000 16.000
3355 Motorola 6.190 13.000 25.000 128.000
1075 Gionee 7.810 8.000 5.000 16.000
3370 Oppo 6.310 8.000 8.000 64.000
1976 Micromax 2.190 0.300 0.300 16.000
1090 Gionee 5.310 13.000 5.000 32.000
ram battery weight release_year days_used ... used_price \
2084 4.000 320.000 86.250 2014 902 ... 56.890
217 6.000 4000.000 186.000 2019 330 ... 244.334
3223 4.000 2800.000 138.000 2016 649 ... 63.680
3314 8.000 4000.000 178.000 2020 221 ... 244.334
2557 4.000 3300.000 172.000 2015 1079 ... 90.190
3355 4.000 3500.000 180.000 2019 320 ... 108.500
1075 4.000 2670.000 135.600 2018 645 ... 52.560
3370 3.000 5000.000 195.000 2019 508 ... 63.310
1976 4.000 1800.000 118.000 2013 868 ... 5.550
1090 4.000 3065.000 183.000 2016 819 ... 57.240
brand_name_encode four_g_no four_g_yes five_g_no five_g_yes \
2084 0.031 1 0 1 0
217 0.039 0 1 1 0
3223 0.039 0 1 1 0
3314 0.074 0 1 0 1
2557 0.102 0 1 1 0
3355 0.031 0 1 1 0
1075 0.016 0 1 1 0
3370 0.036 0 1 1 0
1976 0.034 1 0 1 0
1090 0.016 0 1 1 0
os_Android os_Others os_Windows os_iOS
2084 1 0 0 0
217 1 0 0 0
3223 1 0 0 0
3314 1 0 0 0
2557 1 0 0 0
3355 1 0 0 0
1075 1 0 0 0
3370 1 0 0 0
1976 0 1 0 0
1090 1 0 0 0
[10 rows x 21 columns]
#df['brand_name_enc']
#df=pd.get_dummies(df, columns=['os'])
#df_cell=pd.get_dummies(df_cell, columns=['os'])
#df_cell.drop(['brand_name', 'os_Others'],axis=1,inplace=True)
#df_cell=df_cell.drop('os_others' , axis=1,inplace=True)
print(df.sample(10))
brand_name screen_size main_camera_mp selfie_camera_mp int_memory \
1186 Honor 5.310 13.000 5.000 32.000
1802 LG 5.000 8.000 1.300 32.000
368 Realme 6.310 8.000 16.000 128.000
594 Others 6.130 12.000 8.000 32.000
3339 Motorola 6.130 8.000 5.000 32.000
563 Alcatel 5.000 8.000 2.000 32.000
3566 Asus 6.250 8.000 8.000 64.000
2677 Samsung 5.060 16.000 2.000 16.000
223 ZTE 7.810 13.000 8.000 16.000
2569 Samsung 1.905 8.000 5.000 16.000
ram battery weight release_year days_used ... used_price \
1186 4.000 3000.000 162.000 2014 668 ... 79.990
1802 4.000 2540.000 134.000 2014 780 ... 75.790
368 4.000 5000.000 199.000 2020 362 ... 98.100
594 4.000 4000.000 168.000 2018 396 ... 142.730
3339 2.000 3550.000 185.000 2020 174 ... 95.610
563 4.000 2000.000 128.000 2014 834 ... 60.090
3566 6.000 5000.000 190.000 2019 232 ... 244.334
2677 4.000 2800.000 145.000 2014 977 ... 74.530
223 2.000 3200.000 135.000 2018 595 ... 74.400
2569 4.000 300.000 86.250 2015 604 ... 120.460
brand_name_encode four_g_no four_g_yes five_g_no five_g_yes \
1186 0.033 1 0 1 0
1802 0.059 0 1 1 0
368 0.011 0 1 1 0
594 0.143 0 1 1 0
3339 0.031 0 1 1 0
563 0.035 1 0 1 0
3566 0.035 0 1 1 0
2677 0.102 0 1 1 0
223 0.039 0 1 1 0
2569 0.102 1 0 1 0
os_Android os_Others os_Windows os_iOS
1186 1 0 0 0
1802 1 0 0 0
368 1 0 0 0
594 1 0 0 0
3339 1 0 0 0
563 1 0 0 0
3566 1 0 0 0
2677 1 0 0 0
223 1 0 0 0
2569 0 1 0 0
[10 rows x 21 columns]
df.drop(['brand_name', 'os_Windows','four_g_no','five_g_yes'],axis=1,inplace=True)
df
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | brand_name_encode | four_g_yes | five_g_no | os_Android | os_Others | os_iOS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.665 | 13.000 | 5.000 | 64.000 | 3.000 | 3020.000 | 146.000 | 2020 | 127 | 111.620 | 86.960 | 0.033 | 1 | 1 | 1 | 0 | 0 |
| 1 | 8.665 | 13.000 | 16.000 | 128.000 | 8.000 | 4300.000 | 213.000 | 2020 | 325 | 249.390 | 161.490 | 0.033 | 1 | 0 | 1 | 0 | 0 |
| 2 | 8.665 | 13.000 | 8.000 | 128.000 | 8.000 | 4200.000 | 213.000 | 2020 | 162 | 359.470 | 244.334 | 0.033 | 1 | 0 | 1 | 0 | 0 |
| 4 | 6.190 | 13.000 | 8.000 | 64.000 | 3.000 | 5000.000 | 185.000 | 2020 | 293 | 140.870 | 103.800 | 0.033 | 1 | 1 | 1 | 0 | 0 |
| 5 | 8.440 | 13.000 | 8.000 | 64.000 | 4.000 | 4000.000 | 176.000 | 2020 | 223 | 157.700 | 113.670 | 0.033 | 1 | 1 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3566 | 6.250 | 8.000 | 8.000 | 64.000 | 6.000 | 5000.000 | 190.000 | 2019 | 232 | 523.891 | 244.334 | 0.035 | 1 | 1 | 1 | 0 | 0 |
| 3567 | 6.000 | 13.000 | 8.000 | 128.000 | 8.000 | 4000.000 | 200.000 | 2018 | 541 | 518.814 | 244.334 | 0.035 | 1 | 1 | 1 | 0 | 0 |
| 3568 | 7.370 | 13.000 | 5.000 | 32.000 | 3.000 | 4000.000 | 165.000 | 2020 | 201 | 92.650 | 69.810 | 0.035 | 1 | 1 | 1 | 0 | 0 |
| 3569 | 7.370 | 13.000 | 5.000 | 32.000 | 2.000 | 4000.000 | 160.000 | 2020 | 149 | 101.915 | 76.070 | 0.035 | 1 | 1 | 1 | 0 | 0 |
| 3570 | 5.310 | 13.000 | 5.000 | 16.000 | 2.000 | 4000.000 | 168.000 | 2020 | 176 | 72.241 | 50.480 | 0.035 | 1 | 1 | 1 | 0 | 0 |
3188 rows × 17 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3188 entries, 0 to 3570 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 screen_size 3188 non-null float64 1 main_camera_mp 3188 non-null float64 2 selfie_camera_mp 3188 non-null float64 3 int_memory 3188 non-null float64 4 ram 3188 non-null float64 5 battery 3188 non-null float64 6 weight 3188 non-null float64 7 release_year 3188 non-null int64 8 days_used 3188 non-null int64 9 new_price 3188 non-null float64 10 used_price 3188 non-null float64 11 brand_name_encode 3188 non-null float64 12 four_g_yes 3188 non-null uint8 13 five_g_no 3188 non-null uint8 14 os_Android 3188 non-null uint8 15 os_Others 3188 non-null uint8 16 os_iOS 3188 non-null uint8 dtypes: float64(10), int64(2), uint8(5) memory usage: 339.3 KB
df.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'used_price', 'brand_name_encode', 'four_g_yes', 'five_g_no',
'os_Android', 'os_Others', 'os_iOS'],
dtype='object')
# pairplot of numeric variables
sns.pairplot(df.iloc[:, 0:11], diag_kind='kde') # to plot density curve instead of histogram on the diag
<seaborn.axisgrid.PairGrid at 0x164c95f9af0>
df.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'used_price', 'brand_name_encode', 'four_g_yes', 'five_g_no',
'os_Android', 'os_Others', 'os_iOS'],
dtype='object')
df.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'used_price', 'brand_name_encode', 'four_g_yes', 'five_g_no',
'os_Android', 'os_Others', 'os_iOS'],
dtype='object')
# defining X and y variables
X = df.drop(["used_price"], axis=1)
y = df["used_price"]
print(X.head())
print(y.head())
screen_size main_camera_mp selfie_camera_mp int_memory ram battery \ 0 8.665 13.000 5.000 64.000 3.000 3020.000 1 8.665 13.000 16.000 128.000 8.000 4300.000 2 8.665 13.000 8.000 128.000 8.000 4200.000 4 6.190 13.000 8.000 64.000 3.000 5000.000 5 8.440 13.000 8.000 64.000 4.000 4000.000 weight release_year days_used new_price brand_name_encode four_g_yes \ 0 146.000 2020 127 111.620 0.033 1 1 213.000 2020 325 249.390 0.033 1 2 213.000 2020 162 359.470 0.033 1 4 185.000 2020 293 140.870 0.033 1 5 176.000 2020 223 157.700 0.033 1 five_g_no os_Android os_Others os_iOS 0 1 1 0 0 1 0 1 0 0 2 0 1 0 0 4 1 1 0 0 5 1 1 0 0 0 86.960 1 161.490 2 244.334 4 103.800 5 113.670 Name: used_price, dtype: float64
X.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'brand_name_encode', 'four_g_yes', 'five_g_no', 'os_Android',
'os_Others', 'os_iOS'],
dtype='object')
# Split X and y into training and test set in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 2231 Number of rows in test data = 957
# data normalization with sklearn
from sklearn.preprocessing import MinMaxScaler
# fit scaler on training data
norm = MinMaxScaler().fit(X_train)
# transform training data
X_train_norm = norm.transform(X_train)
# transform testing dataabs
X_test_norm = norm.transform(X_test)
import pylab
import scipy.stats as stats
cols_to_mm = ['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',]#, 'release_year'
for colname in cols_to_mm:
#df[colname+'_ss'] = StandardScaler().fit_transform(df[[colname]])
print("probability plot of ", colname)
stats.probplot(df[colname], dist="norm", plot=pylab)
#sns.histplot(df[colname+'_ss'], bins=50, kde=True)
plt.show()
#plt.show()
probability plot of screen_size
probability plot of main_camera_mp
probability plot of selfie_camera_mp
probability plot of int_memory
probability plot of ram
probability plot of battery
probability plot of weight
probability plot of release_year
probability plot of days_used
probability plot of new_price
X_train.describe()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | brand_name_encode | four_g_yes | five_g_no | os_Android | os_Others | os_iOS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 | 2231.000 |
| mean | 5.397 | 9.725 | 6.807 | 55.400 | 3.990 | 2835.737 | 155.549 | 2015.995 | 673.528 | 214.603 | 0.056 | 0.682 | 0.962 | 0.908 | 0.064 | 0.009 |
| std | 1.510 | 4.765 | 6.906 | 85.344 | 1.237 | 1055.838 | 30.107 | 2.267 | 246.760 | 129.480 | 0.042 | 0.466 | 0.191 | 0.289 | 0.244 | 0.097 |
| min | 1.905 | 0.080 | 0.300 | 0.005 | 0.030 | 80.000 | 86.250 | 2013.000 | 91.000 | 9.630 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 25% | 4.440 | 8.000 | 2.000 | 16.000 | 4.000 | 2000.000 | 138.900 | 2014.000 | 540.000 | 118.996 | 0.031 | 0.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| 50% | 5.190 | 8.000 | 5.000 | 32.000 | 4.000 | 2930.000 | 155.000 | 2016.000 | 686.000 | 181.340 | 0.036 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| 75% | 6.130 | 13.000 | 8.000 | 64.000 | 4.000 | 3500.000 | 174.000 | 2018.000 | 863.000 | 280.845 | 0.074 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 |
| max | 8.665 | 48.000 | 32.000 | 1024.000 | 8.000 | 5750.000 | 226.650 | 2020.000 | 1094.000 | 523.891 | 0.143 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
# data standardization with sklearn
from sklearn.preprocessing import StandardScaler
# copy of datasets
X_train_stand = X_train.copy()
X_test_stand = X_test.copy()
# numerical features
num_cols = ['screen_size','main_camera_mp',
'selfie_camera_mp', 'int_memory', 'ram', 'battery', 'weight',
'days_used', 'new_price']
# apply standardization on numerical features
for i in num_cols:
# fit on training data column
scale = StandardScaler().fit(X_train_stand[[i]])
# transform the training data column
X_train_stand[i] = scale.transform(X_train_stand[[i]])
# transform the testing data column
X_test_stand[i] = scale.transform(X_test_stand[[i]])
# fit linear model
regression_model = LinearRegression()
regression_model.fit(X_train, y_train)
LinearRegression()
coef_df = pd.DataFrame(
np.append(regression_model.coef_, regression_model.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| screen_size | 0.473 |
| main_camera_mp | -0.355 |
| selfie_camera_mp | 0.695 |
| int_memory | 0.021 |
| ram | 2.258 |
| battery | -0.001 |
| weight | 0.028 |
| release_year | 0.003 |
| days_used | -0.085 |
| new_price | 0.391 |
| brand_name_encode | 0.905 |
| four_g_yes | -2.226 |
| five_g_no | 1.816 |
| os_Android | 1.670 |
| os_Others | -0.120 |
| os_iOS | 12.772 |
| Intercept | 42.629 |
Let's check the performance of the model using different metrics.
We will be using metric functions defined in sklearn for RMSE, MAE, and $R^2$ .
We will define a function to calculate MAPE and adjusted $R^2$ .
The mean absolute percentage error (MAPE) measures the accuracy of predictions as a percentage, and can be calculated as the average absolute percent error for each predicted value minus actual values divided by actual values. It works best if there are no extreme values in the data and none of the actual values are 0. We will create a function which will print out all the above metrics in one go.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
linearregression_train_perf = model_performance_regression(
regression_model, X_train, y_train
)
linearregression_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 13.430 | 9.972 | 0.956 | 0.956 | 18.639 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
linearregression_test_perf = model_performance_regression(
regression_model, X_test, y_test
)
linearregression_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 14.057 | 10.285 | 0.956 | 0.955 | 20.586 |
The training 𝑅2 is 95.6%, indicating that the model explains 95.6% of the variation in the train data. So, the model is not underfitting.
MAE and RMSE on the train and test sets are comparable, which shows that the model is not overfitting.
MAE indicates that our current model is able to predict used phone prices within a mean error of ~10 Euros on the test data.
MAPE on the test set suggests we can predict within ~20.6% of the used phone prices.
Let's build a linear regression model using statsmodels.
import statsmodels.api as sm
# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
X_train1 = sm.add_constant(X_train)
# adding constant to the test data
X_test1 = sm.add_constant(X_test)
olsmod0 = sm.OLS(y_train, X_train1).fit()
print(olsmod0.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.956
Model: OLS Adj. R-squared: 0.956
Method: Least Squares F-statistic: 3013.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 22:53:07 Log-Likelihood: -8960.7
No. Observations: 2231 AIC: 1.796e+04
Df Residuals: 2214 BIC: 1.805e+04
Df Model: 16
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 42.6290 540.755 0.079 0.937 -1017.812 1103.070
screen_size 0.4733 0.336 1.407 0.160 -0.186 1.133
main_camera_mp -0.3554 0.080 -4.448 0.000 -0.512 -0.199
selfie_camera_mp 0.6949 0.068 10.252 0.000 0.562 0.828
int_memory 0.0214 0.004 5.794 0.000 0.014 0.029
ram 2.2577 0.322 7.019 0.000 1.627 2.889
battery -0.0006 0.001 -1.135 0.257 -0.002 0.000
weight 0.0280 0.016 1.804 0.071 -0.002 0.058
release_year 0.0025 0.268 0.010 0.992 -0.523 0.528
days_used -0.0851 0.002 -46.752 0.000 -0.089 -0.081
new_price 0.3910 0.003 119.062 0.000 0.385 0.397
brand_name_encode 0.9055 6.918 0.131 0.896 -12.661 14.472
four_g_yes -2.2262 0.921 -2.417 0.016 -4.033 -0.420
five_g_no 1.8156 1.925 0.943 0.346 -1.959 5.590
os_Android 1.6699 2.126 0.785 0.432 -2.499 5.839
os_Others -0.1199 2.515 -0.048 0.962 -5.053 4.813
os_iOS 12.7716 3.812 3.351 0.001 5.296 20.247
==============================================================================
Omnibus: 204.310 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 421.000
Skew: 0.586 Prob(JB): 3.81e-92
Kurtosis: 4.776 Cond. No. 6.92e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.92e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
Negative values of the coefficient show that used_price decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that used_price increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
So, we need to deal with multicollinearity and check the other assumptions of linear regression first, and then look at the p-values.
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors.
General Rule of thumb:
df.columns
Index(['screen_size', 'main_camera_mp', 'selfie_camera_mp', 'int_memory',
'ram', 'battery', 'weight', 'release_year', 'days_used', 'new_price',
'used_price', 'brand_name_encode', 'four_g_yes', 'five_g_no',
'os_Android', 'os_Others', 'os_iOS'],
dtype='object')
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(X_train1)
| feature | VIF | |
|---|---|---|
| 0 | const | 3589223.383 |
| 1 | screen_size | 3.166 |
| 2 | main_camera_mp | 1.779 |
| 3 | selfie_camera_mp | 2.689 |
| 4 | int_memory | 1.224 |
| 5 | ram | 1.944 |
| 6 | battery | 3.968 |
| 7 | weight | 2.682 |
| 8 | release_year | 4.537 |
| 9 | days_used | 2.473 |
| 10 | new_price | 2.218 |
| 11 | brand_name_encode | 1.050 |
| 12 | four_g_yes | 2.258 |
| 13 | five_g_no | 1.666 |
| 14 | os_Android | 4.629 |
| 15 | os_Others | 4.629 |
| 16 | os_iOS | 1.663 |
release_years have VIF slightly greater than 5.But it is a temporal variable This variable are correlated with each other.
To remove multicollinearity
# function to treat multicollinearity
def treating_multicollinearity(predictors, target, high_vif_columns):
"""
Checking the effect of dropping the columns showing high multicollinearity
on model performance (adj. R-squared and RMSE)
predictors: independent variables
target: dependent variable
high_vif_columns: columns having high VIF
"""
# empty lists to store adj. R-squared and RMSE values
adj_r2 = []
rmse = []
# build ols models by dropping one of the high VIF columns at a time
# store the adjusted R-squared and RMSE in the lists defined previously
for cols in high_vif_columns:
# defining the new train set
train = predictors.loc[:, ~predictors.columns.str.startswith(cols)]
# create the model
olsmodel = sm.OLS(target, train).fit()
# adding adj. R-squared and RMSE to the lists
adj_r2.append(olsmodel.rsquared_adj)
rmse.append(np.sqrt(olsmodel.mse_resid))
# creating a dataframe for the results
temp = pd.DataFrame(
{
"col": high_vif_columns,
"Adj. R-squared after_dropping col": adj_r2,
"RMSE after dropping col": rmse,
}
).sort_values(by="Adj. R-squared after_dropping col", ascending=False)
temp.reset_index(drop=True, inplace=True)
return temp
# check Adjusted R^2 and RMSE after dropping variables having p>0.05
col_list = ['release_year','screen_size','main_camera_mp' , 'battery', 'weight' ]
res = treating_multicollinearity(X_train1, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | release_year | 0.956 | 13.479 |
| 1 | battery | 0.956 | 13.483 |
| 2 | screen_size | 0.956 | 13.485 |
| 3 | weight | 0.956 | 13.489 |
| 4 | main_camera_mp | 0.955 | 13.539 |
The above predictors have no multicollinearity and the assumption is satisfied.
Let's check the model performance.
# checking VIF after dropping columns having p>0.05
col_to_drop ='release_year','screen_size','main_camera_mp' , 'battery', 'weight'
X_train2 = X_train1.loc[:, ~X_train1.columns.str.startswith(col_to_drop)]
X_test2 = X_test1.loc[:, ~X_test1.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(X_train2)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping ('release_year', 'screen_size', 'main_camera_mp', 'battery', 'weight')
| feature | VIF | |
|---|---|---|
| 0 | const | 149.835 |
| 1 | selfie_camera_mp | 2.133 |
| 2 | int_memory | 1.196 |
| 3 | ram | 1.912 |
| 4 | days_used | 1.570 |
| 5 | new_price | 1.789 |
| 6 | brand_name_encode | 1.035 |
| 7 | four_g_yes | 1.553 |
| 8 | five_g_no | 1.575 |
| 9 | os_Android | 4.599 |
| 10 | os_Others | 4.329 |
| 11 | os_iOS | 1.574 |
Regression Model will be evaluated using the following evaluation metrics:
# fitting model again after dropping required variables
olsmod1 = sm.OLS(y_train, X_train2).fit()
print(olsmod1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.956
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 4338.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 22:53:07 Log-Likelihood: -8974.0
No. Observations: 2231 AIC: 1.797e+04
Df Residuals: 2219 BIC: 1.804e+04
Df Model: 11
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
const 53.3645 3.511 15.200 0.000 46.480 60.249
selfie_camera_mp 0.6665 0.061 10.985 0.000 0.547 0.785
int_memory 0.0242 0.004 6.590 0.000 0.017 0.031
ram 2.3793 0.321 7.423 0.000 1.751 3.008
days_used -0.0867 0.001 -59.541 0.000 -0.090 -0.084
new_price 0.3871 0.003 130.612 0.000 0.381 0.393
brand_name_encode 1.9669 6.902 0.285 0.776 -11.568 15.502
four_g_yes -3.1960 0.768 -4.163 0.000 -4.702 -1.691
five_g_no -0.0810 1.880 -0.043 0.966 -3.768 3.606
os_Android 1.7400 2.129 0.817 0.414 -2.435 5.915
os_Others -0.2713 2.444 -0.111 0.912 -5.065 4.522
os_iOS 13.3889 3.727 3.593 0.000 6.080 20.697
==============================================================================
Omnibus: 208.385 Durbin-Watson: 1.986
Prob(Omnibus): 0.000 Jarque-Bera (JB): 413.196
Skew: 0.608 Prob(JB): 1.89e-90
Kurtosis: 4.722 Cond. No. 1.80e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.8e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
# check Adjusted R^2 and RMSE after dropping variables with p>0.05
col_list = ['five_g_no','os_Others','os_Android' ,'brand_name_encode']
res = treating_multicollinearity(X_train2, y_train, col_list)
res
| col | Adj. R-squared after_dropping col | RMSE after dropping col | |
|---|---|---|---|
| 0 | five_g_no | 0.955 | 13.544 |
| 1 | os_Others | 0.955 | 13.544 |
| 2 | brand_name_encode | 0.955 | 13.544 |
| 3 | os_Android | 0.955 | 13.546 |
The above predictors have no multicollinearity and the assumption is satisfied.
Let's check the model performance.
# check Adjusted R^2 and RMSE after dropping variables
col_to_drop ='five_g_no','os_Others','os_Android','brand_name_encode'
X_train3 = X_train2.loc[:, ~X_train2.columns.str.startswith(col_to_drop)]
X_test3 = X_test2.loc[:, ~X_test2.columns.str.startswith(col_to_drop)]
# Check VIF now
vif = checking_vif(X_train3)
print("VIF after dropping ", col_to_drop)
vif
VIF after dropping ('five_g_no', 'os_Others', 'os_Android', 'brand_name_encode')
| feature | VIF | |
|---|---|---|
| 0 | const | 31.207 |
| 1 | selfie_camera_mp | 2.113 |
| 2 | int_memory | 1.147 |
| 3 | ram | 1.492 |
| 4 | days_used | 1.488 |
| 5 | new_price | 1.737 |
| 6 | four_g_yes | 1.464 |
| 7 | os_iOS | 1.077 |
# initial list of columns
cols = X_train3.columns.tolist()
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = X_train3[cols]
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
['const', 'selfie_camera_mp', 'int_memory', 'ram', 'days_used', 'new_price', 'four_g_yes', 'os_iOS']
# train data with only selected columns
x_train4 = X_train3[selected_features]
x_test4 = X_test3[selected_features]
# fitting model on train data after dropping required columns
olsmod2 = sm.OLS(y_train, x_train4).fit()
print(olsmod2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 6819.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 22:53:07 Log-Likelihood: -8975.6
No. Observations: 2231 AIC: 1.797e+04
Df Residuals: 2223 BIC: 1.801e+04
Df Model: 7
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 54.3224 1.602 33.911 0.000 51.181 57.464
selfie_camera_mp 0.6633 0.060 10.987 0.000 0.545 0.782
int_memory 0.0242 0.004 6.726 0.000 0.017 0.031
ram 2.4671 0.283 8.714 0.000 1.912 3.022
days_used -0.0866 0.001 -61.091 0.000 -0.089 -0.084
new_price 0.3875 0.003 132.740 0.000 0.382 0.393
four_g_yes -2.9494 0.745 -3.958 0.000 -4.411 -1.488
os_iOS 11.5997 3.082 3.763 0.000 5.555 17.644
==============================================================================
Omnibus: 204.467 Durbin-Watson: 1.988
Prob(Omnibus): 0.000 Jarque-Bera (JB): 414.705
Skew: 0.591 Prob(JB): 8.87e-91
Kurtosis: 4.750 Cond. No. 8.02e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Now no feature has p-value greater than 0.05, so we'll consider the features in x_train4 as the final ones and olsmod2 as final model.
Observations
Now we'll check the rest of the assumptions on olsmod2.
Linearity of variables & Independence of error terms
Normality of error terms
No Heteroscedasticity
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod2.fittedvalues # predicted values
df_pred["Residuals"] = olsmod2.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 776 | 103.940 | 94.076 | 9.864 |
| 1451 | 60.250 | 55.093 | 5.157 |
| 1812 | 65.690 | 60.086 | 5.604 |
| 2970 | 232.260 | 218.623 | 13.637 |
| 2741 | 132.110 | 156.819 | -24.709 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="green", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
plt.plot(df_pred["Fitted Values"], df_pred["Residuals"], 'o', color='green')
#obtain m (slope) and b(intercept) of linear regression line
m, b = np.polyfit(df_pred["Fitted Values"], df_pred["Residuals"], 1)
#use red as color for regression line
plt.plot(df_pred["Fitted Values"], m*df_pred["Fitted Values"]+b, color='red')
[<matplotlib.lines.Line2D at 0x164ccc55b80>]
The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).
We see no pattern in the plot above.
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9685258865356445, pvalue=1.0687705648197343e-21)
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train4)
lzip(name, test)
[('F statistic', 0.9465894859592179), ('p-value', 0.8193959059441539)]
# predictions on the test set
pred = olsmod2.predict(x_test4)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 2720 | 84.540 | 100.380 |
| 1048 | 10.980 | 23.895 |
| 2761 | 119.710 | 133.976 |
| 2191 | 31.980 | 32.024 |
| 2527 | 35.800 | 27.008 |
| 2842 | 67.760 | 65.123 |
| 654 | 50.920 | 59.858 |
| 2268 | 29.090 | 9.554 |
| 765 | 36.120 | 40.868 |
| 1014 | 11.960 | 4.605 |
We can observe here that the model has returned good enough prediction results, and the actual and predicted values are comparable.
We can also visualize comparison result as a bar graph.
Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod2_train_perf = model_performance_regression(olsmod2, x_train4, y_train)
olsmod2_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 13.520 | 10.043 | 0.955 | 0.955 | 18.883 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod2_test_perf = model_performance_regression(olsmod2, x_test4, y_test)
olsmod2_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 14.080 | 10.304 | 0.956 | 0.955 | 20.657 |
The model is able to explain 95.6% of the variation in the data, which is very good.
The train and test RMSE and MAE are and comparable. So, our model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within 20.7% of the used phone prices.
Hence, we can conclude the model olsmod2 is good for prediction as well as inference purposes.
Let's compare the initial model created with sklearn and the final statsmodels model.
# training performance comparison
models_train_comp_df = pd.concat(
[linearregression_train_perf.T, olsmod2_train_perf.T],
axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 13.430 | 13.520 |
| MAE | 9.972 | 10.043 |
| R-squared | 0.956 | 0.955 |
| Adj. R-squared | 0.956 | 0.955 |
| MAPE | 18.639 | 18.883 |
olsmodel_final = sm.OLS(y_train, x_train4).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price R-squared: 0.955
Model: OLS Adj. R-squared: 0.955
Method: Least Squares F-statistic: 6819.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 22:53:09 Log-Likelihood: -8975.6
No. Observations: 2231 AIC: 1.797e+04
Df Residuals: 2223 BIC: 1.801e+04
Df Model: 7
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
const 54.3224 1.602 33.911 0.000 51.181 57.464
selfie_camera_mp 0.6633 0.060 10.987 0.000 0.545 0.782
int_memory 0.0242 0.004 6.726 0.000 0.017 0.031
ram 2.4671 0.283 8.714 0.000 1.912 3.022
days_used -0.0866 0.001 -61.091 0.000 -0.089 -0.084
new_price 0.3875 0.003 132.740 0.000 0.382 0.393
four_g_yes -2.9494 0.745 -3.958 0.000 -4.411 -1.488
os_iOS 11.5997 3.082 3.763 0.000 5.555 17.644
==============================================================================
Omnibus: 204.467 Durbin-Watson: 1.988
Prob(Omnibus): 0.000 Jarque-Bera (JB): 414.705
Skew: 0.591 Prob(JB): 8.87e-91
Kurtosis: 4.750 Cond. No. 8.02e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
What does the distribution of used phone prices look like?
Observations
What percentage of the used phone market is dominated by Android devices?
Which attributes are highly correlated with the used phone price?
The model is able to explain 95.6% of the variation in the data, which is very good.
The train and test RMSE and MAE are and comparable. So, our model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within 20.7% of the used phone prices.
Hence, we can conclude the model olsmod2 is good for prediction as well as inference purposes.
After comparing the initial model created with sklearn and the final statsmodels model, the statsmodel showed minimal reduction in RMSE and MAE and MAPE, but R^2 and adjusted R^2 remained the same.
All these variables have probability p<0.05 rejecting the null hypothesis that they are insignificant.
-